Introduction

The goal of this final project is to build the most accurate model for predicting a person’s sleep efficiency score. We will implement several machine learning techniques in order to solve this regression problem, and gain a few new insights on improving our overall health via quality of sleep together.

What is Sleep Efficiency?

Sleep is more important than many of us are willing to admit, especially as university students who are learning to balance living on our own for the first time. Personally, I feel like there aren’t enough hours in a day to balance classes, studying, exercise, socialization, and downtime. Instead, I turn to chugging caffeine to sustain the long hours of cramming, and digging into a few hours of my precious beauty sleep to decompress with friends or enjoy some screen-time before bed. Most days, I wake up poorly rested and barely scrape through the duties of the day. I feel groggy and frustrated by my lethargy… Especially if I had gotten the recommended 7-9 hours the night before! I’m sure many of you can relate, whether you’re a student or not!

Scientists consider 80% or above to be a normal sleep efficiency (SE) score. Most young adults who are in good health exhibit a SE of at least 90%. Unfortunately, the research seems to suggest that sleep efficiency only declines with age. That’s why I investigated sleep efficiency: a measure of the proportion of time spent in bed that we’re actually asleep during. To be precise, we define “time in bed” as the period beginning when we first try to fall sleep and ending when we wake-up for the final time.

Data Description

Our aim is to predict the variable Sleep.efficiency: a measure of the proportion of the time in bed spent asleep. Its values may range from [0,1] although the lowest observed score within our data is 0.5. Sleep efficiency is a critical aspect of gauging sleep quality and overall well-being. This measure accounts for periods of wakefulness and sleeplessness throughout the night. Our model will give us some insight as to which potential variables improve and decline our sleep efficiency, and to what degree. Sleep efficiency scores are also indicative of individuals’ overall health–the ‘Hypersomnia Foundation’ claims that normal, healthy scores are 85% or higher.

Project Roadmap

For this Statistical Machine Learning project, we will be working with data from the file "sleep_efficiency.csv". Our goal is to build the best predictive model for the outcome Sleep.efficiency, and this large task requires a detailed step-by-step procedure. First, we’ll load our data, RStudio packages, and try to gain an overall understanding our raw data. Once we run a few lines of code, we can clean our data by removing the variables we don’t need and convert the format of our predictors so they’re ready for our recipe. We’ll split our data into training/testing sets, create 10 folds for our cross-validation, and code a universal recipe to all five of our models: Linear Regression, Ridge Regression, K-Nearest Neighbors, Random Forests, and Gradient Boosted Trees. Once we train these models on our folded data, we can determine our best two models based on our chosen performance metric. The final step is to fit these two best models to our testing data to evaluate actually how good our best model can predict sleep efficiency scores. Are you ready for it?

Loading Packages and Data

First and foremost, let’s load all the necessary packages and the raw sleep-efficiency data into our workspace:

# Installing required packages 
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom        1.0.5     ✔ rsample      1.2.0
## ✔ dials        1.2.0     ✔ tune         1.1.2
## ✔ infer        1.0.5     ✔ workflows    1.1.3
## ✔ modeldata    1.3.0     ✔ workflowsets 1.0.1
## ✔ parsnip      1.1.1     ✔ yardstick    1.3.0
## ✔ recipes      1.0.9     
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Search for functions across packages at https://www.tidymodels.org/find/
library(kknn)
library(corrplot)
## corrplot 0.92 loaded
library(ggplot2)
library(ggthemes)
library(recipes)
library(parsnip)
library(ranger)
library(glmnet)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loaded glmnet 4.1-8
library(vip)
## 
## Attaching package: 'vip'
## 
## The following object is masked from 'package:utils':
## 
##     vi
library(finalfit)
library(ISLR)
library(conflicted)
library(xgboost)

# Loading the CSV file
sleep <- read.csv("~/Desktop/sleep_efficiency.csv")

Exploratory Data Analysis

Exploring the Raw Data

Before delving into any further processing, let’s begin our initial exploration of the raw data that we obtained directly from our CSV file. It’s extremely unlikely that data sets, especially as large as this one, are already “clean” and immediately ready for analysis. We will check for missing (NA) values to tidy up and evaluate whether or not we have to convert some of our variables into factors. Manipulating and cleaning up the data is an essential preparatory step for later data visualizations and analysis.

# Briefly overview raw data
head(sleep)
##   ID Age Gender             Bedtime         Wakeup.time Sleep.duration
## 1  1  65 Female 2021-03-06 01:00:00 2021-03-06 07:00:00            6.0
## 2  2  69   Male 2021-12-05 02:00:00 2021-12-05 09:00:00            7.0
## 3  3  40 Female 2021-05-25 21:30:00 2021-05-25 05:30:00            8.0
## 4  4  40 Female 2021-11-03 02:30:00 2021-11-03 08:30:00            6.0
## 5  5  57   Male 2021-03-13 01:00:00 2021-03-13 09:00:00            8.0
## 6  6  36 Female 2021-07-01 21:00:00 2021-07-01 04:30:00            7.5
##   Sleep.efficiency REM.sleep.percentage Deep.sleep.percentage
## 1             0.88                   18                    70
## 2             0.66                   19                    28
## 3             0.89                   20                    70
## 4             0.51                   23                    25
## 5             0.76                   27                    55
## 6             0.90                   23                    60
##   Light.sleep.percentage Awakenings Caffeine.consumption Alcohol.consumption
## 1                     12          0                    0                   0
## 2                     53          3                    0                   3
## 3                     10          1                    0                   0
## 4                     52          3                   50                   5
## 5                     18          3                    0                   3
## 6                     17          0                   NA                   0
##   Smoking.status Exercise.frequency
## 1            Yes                  3
## 2            Yes                  3
## 3             No                  3
## 4            Yes                  1
## 5             No                  3
## 6             No                  1
summary(sleep)
##        ID             Age           Gender            Bedtime         
##  Min.   :  1.0   Min.   : 9.00   Length:452         Length:452        
##  1st Qu.:113.8   1st Qu.:29.00   Class :character   Class :character  
##  Median :226.5   Median :40.00   Mode  :character   Mode  :character  
##  Mean   :226.5   Mean   :40.29                                        
##  3rd Qu.:339.2   3rd Qu.:52.00                                        
##  Max.   :452.0   Max.   :69.00                                        
##                                                                       
##  Wakeup.time        Sleep.duration   Sleep.efficiency REM.sleep.percentage
##  Length:452         Min.   : 5.000   Min.   :0.5000   Min.   :15.00       
##  Class :character   1st Qu.: 7.000   1st Qu.:0.6975   1st Qu.:20.00       
##  Mode  :character   Median : 7.500   Median :0.8200   Median :22.00       
##                     Mean   : 7.466   Mean   :0.7889   Mean   :22.62       
##                     3rd Qu.: 8.000   3rd Qu.:0.9000   3rd Qu.:25.00       
##                     Max.   :10.000   Max.   :0.9900   Max.   :30.00       
##                                                                           
##  Deep.sleep.percentage Light.sleep.percentage   Awakenings   
##  Min.   :18.00         Min.   : 7.00          Min.   :0.000  
##  1st Qu.:48.25         1st Qu.:15.00          1st Qu.:1.000  
##  Median :58.00         Median :18.00          Median :1.000  
##  Mean   :52.82         Mean   :24.56          Mean   :1.641  
##  3rd Qu.:63.00         3rd Qu.:32.50          3rd Qu.:3.000  
##  Max.   :75.00         Max.   :63.00          Max.   :4.000  
##                                               NA's   :20     
##  Caffeine.consumption Alcohol.consumption Smoking.status     Exercise.frequency
##  Min.   :  0.00       Min.   :0.000       Length:452         Min.   :0.000     
##  1st Qu.:  0.00       1st Qu.:0.000       Class :character   1st Qu.:0.000     
##  Median : 25.00       Median :0.000       Mode  :character   Median :2.000     
##  Mean   : 23.65       Mean   :1.174                          Mean   :1.791     
##  3rd Qu.: 50.00       3rd Qu.:2.000                          3rd Qu.:3.000     
##  Max.   :200.00       Max.   :5.000                          Max.   :5.000     
##  NA's   :25           NA's   :14                             NA's   :6

Running these lines of code reveals a general informational overview of our data. The result summary reveals that there are missing values in our data, but don’t worry! This is quite a common occurrence for data sets—whether they are large or small.

Tidying the Raw Data

Addressing Missing Values

In a sleep study setting, it’s likely to encounter missing or inaccurate data as a consequence of technical issues and human error. We can either choose to omit or impute missing values within a data set, depending on the proportion relative to the entire data set and the distribution of the missing predictors. Let’s explore their rough distribution with a missing values plot!

# Missing value plot from 'finalfit' package
sleep %>%
  missing_plot()

Notice that only 4 of the predictors have missing values. Let’s run this next chunk to figure out exactly how many missing values we’re dealing with:

colSums(is.na(sleep))
##                     ID                    Age                 Gender 
##                      0                      0                      0 
##                Bedtime            Wakeup.time         Sleep.duration 
##                      0                      0                      0 
##       Sleep.efficiency   REM.sleep.percentage  Deep.sleep.percentage 
##                      0                      0                      0 
## Light.sleep.percentage             Awakenings   Caffeine.consumption 
##                      0                     20                     25 
##    Alcohol.consumption         Smoking.status     Exercise.frequency 
##                     14                      0                      6

Awakenings, Caffeine.consumption, Alcohol.consumption, and Exercise.frequency have 20, 25, 14, and 6 missing values, respectively… Not too many after all! However, these predictors with missing values aren’t appropriate for linear imputation because their relationship with other predictor variables won’t always follow a linear trend. Mean or median imputation is inappropriate as well because many of the values aren’t missing completely at random and the non-missing observations in Awakenings have a median of zero.

Therefore, let’s opt to remove the missing values and update the data to reflect this adjustment!

sleep <- sleep %>% 
  drop_na()

Now, we’re interested in understanding the scale of the updated data set.

# Call dim() to see # of rows & columns
dim(sleep)
## [1] 388  15
# Viewing the variable names
names(sleep)
##  [1] "ID"                     "Age"                    "Gender"                
##  [4] "Bedtime"                "Wakeup.time"            "Sleep.duration"        
##  [7] "Sleep.efficiency"       "REM.sleep.percentage"   "Deep.sleep.percentage" 
## [10] "Light.sleep.percentage" "Awakenings"             "Caffeine.consumption"  
## [13] "Alcohol.consumption"    "Smoking.status"         "Exercise.frequency"

Upon removing missing values, our data set currently contains 388 rows of observations and 14 columns of predictor variables.

Cleaning the Data

We will perform the following cleaning procedures to remove unnecessary variables and convert any categorical predictor variables into factors. Let’s remove ID, which is simply a unique number assigned to each sleep study participant for identification purposes.

# Removing the variable "ID"
sleep <- sleep[, !(names(sleep) %in% "ID")]

# Converting 'Gender' 'Smoking.status' to factors
sleep$Gender <- as.factor(sleep$Gender)
sleep$Smoking.status <- as.factor(sleep$Smoking.status)

Adding Predictors

Of the 388 observed participants, ages largely range anywhere from 9 to 69 years old. Age plays a large role in human sleep patterns, as well as certain lifestyle choices. Let’s create a new predictor from Age to categorize each person into one of four age demographics: child, youth, adult, and senior! Age demographic splits sourced from the National Institutes of Health (NIH).

# Specifying the age breaks
age.cutoff.criteria <- c(0, 12, 18, 59, Inf)
# Creating new predictor specifying age group
# A factor with 4 levels
sleep$Age.group <- cut(sleep$Age, breaks = age.cutoff.criteria, 
                       labels = c("Child", "Youth", "Adult", "Senior"), 
                       include.lowest = TRUE)
sleep %>% 
  ggplot(aes(x=Age.group)) +
  geom_bar()

The variables Bedtime and Wakeup.time are currently character strings, both formatted as the date (YYYY-MM-DD) followed by the time (HR:MIN:SEC). Our focus is the bedtime and wakeup times, yet the current format is difficult to implement in predictive and classification settings. Thus, we will remove the date from the strings and convert the predictors to a numeric form since the information isn’t useful for our purposes.

# Date-time conversion allows us to manipulate Bedtime & Wakeup.time
sleep$Bedtime <- as.POSIXct(sleep$Bedtime,
                        format = "%Y-%m-%d %H:%M:%S")
sleep$Wakeup.time <- as.POSIXct(sleep$Wakeup.time,
                        format = "%Y-%m-%d %H:%M:%S")

# Extract hour() & minute() -- from 'lubridate' package
Bedtime.hour <- hour(sleep$Bedtime)
Bedtime.mins <- minute(sleep$Bedtime)
Wakeup.hour <- hour(sleep$Wakeup.time)
Wakeup.mins <- minute(sleep$Wakeup.time) 

# Update Bedtime, Wakeup.time to a numeric form (hours)
sleep$Bedtime <- Bedtime.hour + (Bedtime.mins / 60)
sleep$Wakeup.time <- Wakeup.hour + (Wakeup.mins / 60)

Before delving into each of the predictors, should we perform a quick sanity check?

str(sleep)
## 'data.frame':    388 obs. of  15 variables:
##  $ Age                   : int  65 69 40 40 57 27 53 41 11 50 ...
##  $ Gender                : Factor w/ 2 levels "Female","Male": 1 2 1 1 2 1 2 1 1 2 ...
##  $ Bedtime               : num  1 2 21.5 2.5 1 21 0.5 2.5 1 0.5 ...
##  $ Wakeup.time           : num  7 9 5.5 8.5 9 3 10.5 8.5 10 8.5 ...
##  $ Sleep.duration        : num  6 7 8 6 8 6 10 6 9 8 ...
##  $ Sleep.efficiency      : num  0.88 0.66 0.89 0.51 0.76 0.54 0.9 0.79 0.55 0.92 ...
##  $ REM.sleep.percentage  : int  18 19 20 23 27 28 28 28 18 23 ...
##  $ Deep.sleep.percentage : int  70 28 70 25 55 25 52 55 37 57 ...
##  $ Light.sleep.percentage: int  12 53 10 52 18 47 20 17 45 20 ...
##  $ Awakenings            : num  0 3 1 3 3 2 0 3 4 1 ...
##  $ Caffeine.consumption  : num  0 0 0 50 0 50 50 50 0 50 ...
##  $ Alcohol.consumption   : num  0 3 0 5 3 0 0 0 0 0 ...
##  $ Smoking.status        : Factor w/ 2 levels "No","Yes": 2 2 1 2 1 2 2 1 1 2 ...
##  $ Exercise.frequency    : num  3 3 3 1 3 1 3 1 0 3 ...
##  $ Age.group             : Factor w/ 4 levels "Child","Youth",..: 4 4 3 3 3 3 3 3 1 3 ...

Nice! We can confirm that all the variables are prepared in the correct format for our recipe.

Explaining the Predictors

Upon analyzing and tidying our raw data, we have removed unnecessary variables and converted our predictors into appropriate formatting to yield a clearer analysis. Now we are ready for our visual exploratory analysis! Here are the predictor variables we plan to use in our model recipe to predict sleep efficiency scores, followed by a brief description: - Age : Age of the test subject, as an integer - Gender : Gender of the test subject (male/female), converted to factor - Bedtime : Time (HR:MIN:SEC) at which the subject goes to bed each night, as a character string; time in hours ranging from [0,23] - Wakeup.time : Time (HR:MIN:SEC) at which the subject wakes up each morning, as a character string; we extracted time in hours ranging from [3,12] - Sleep.duration : The total number of hours the subject slept, as a double - REM.sleep.percentage : Percentage of total sleep-time spent in REM sleep, as an integer ranging from [0,100] - Deep.sleep.percentage : Percentage of total sleep-time spent in deep sleep, as an integer ranging from [0,100] - Light.sleep.percentage : Percentage of total sleep time spent in light sleep, as a double - Awakenings : The average number of times the subject wakes up during the night, as a double - Caffeine.consumption : Amount of caffeine (in milligrams) consumed in the 24 hours prior to bedtime - Alcohol.cons : Amount of alcohol (in ounces) consumed in the 24 hours prior to bedtime - Smoking.status : Whether or not the subject smokes (Yes/No), converted to factor - Exercise.frequency : The average number of times the subject exercises each week, as a double - Age.group : A factor with 4 levels, categorizing each participant into one of four age demographics

Visual EDA

To gain some understanding between the response variable and its predictors, we should generate a simple distribution plot and a correlation matrix to identify any possible relationships among variables. Then, we’ll create several visualizations to observe the impact of these predictors of interest on our outcome variable.

Sleep Efficiency Distribution

Let’s first explore the distribution of our outcome variable Sleep.efficiency.

sleep %>%
  ggplot(aes(x=Sleep.efficiency)) + 
  geom_bar(aes(fill = Age.group)) + 
  labs(x = "Sleep Efficiency Score", y = "# of Occurences Per Score", 
       title = "Distribution of Sleep Efficiency Scores Per Age Group") +
  theme_bw() 

Sleep efficiency scores can lie anywhere from 0 to 1. In our data set, we observe that Sleep.efficiency ranges from 0.5 to 1.0, inclusive. This insight is quite intuitive—a zero-score implies that the subject did not sleep. If we observe the bar-chart’s peak, the most frequent score is 0.9. Our project task will be to build a predictive model. An overwhelming majority of our observations fall into the adult age group, so it may be wise to not introduce this categorical variable into our model setup…

Variable Correlation Plot

Next, let’s create a correlation matrix of all our numeric variables which we’ll use to produce a correlation heat map. This map will allow us to visually detect correlations among our predictors and the outcome, Sleep.efficiency.

# Extract numeric predictors
sleep.numer <- sleep %>%
  select(where(is.numeric)) 
# Sleep correlation plot
sleep.heatplot <- corrplot(cor(sleep.numer), order = 'AOE', 
                           addCoef.col = 1, number.cex = 0.3)

Surprising… There aren’t as many relationships as expected!

First, it’s obvious that Light.sleep.percentage-Deep.sleep.percentage and Bedtime-Wakeup.time are strongly negative correlated since they are opposite (or complementary) measures of each other. The pairs have correlation scores of -0.98 and -0.77, respectively, which will most likely generate the issue of multi-collinearity in our model. In addition, I realized that Age.group may be useful for visual EDA, but will probably introduce redundancy if used in combination with the Age predictor. Let’s resolve this problem by not considering one of the two conflicting variables from each pair, as well as Age.group, in our recipe formulation.

Sleep.duration and Wakeup.time also exhibit a noteworthy positive correlation since waking up at a later hour increases the duration of sleep. It seems that Alcohol.consumption has a mild positive correlation with Light.sleep.percentage and a mild negative correlation with Deep.sleep.percentage, potentially because alcohol is a depressant known to disrupt normal sleep cycles. I suspect that Wakeup.time and Exercise.frequency share a moderately negative correlation because individuals who exercise more often may tend to wake up earlier in the day to have enough time for a workout session.

I was shocked that age, alcohol & caffeine consumption, exercise frequency, and light & deep sleep percentage had little to no overall correlation to Sleep.duration or REM.sleep.percentage, yet exhibited their predicted/predictable relationships to Sleep.efficiency.

Sleep Duration

The heat-map plot exhibits a -0.02 correlation score between Sleep.duration and Sleep.efficiency. It seems pretty intuitive to me that sleeping longer would have some positive relationship to overall sleep efficiency, so I was surprised that the score wasn’t even slightly positive. Let’s further investigate the distribution of the two variables with a simple barplot.

# Simple barplot of grouped durations against sleep scores 
sleep %>%
  ggplot(aes(x = as.factor(Sleep.duration), y = Sleep.efficiency)) + 
  geom_bar(stat = "summary", fun = "mean") +
  labs(title = "Average Sleep Efficiency by Sleep Duration",
       x = "Sleep Duration", y = "Sleep Efficiency Score") +
  theme_minimal()

The roughly equal bar heights suggest that across various sleep durations, the corresponding sleep efficiency scores don’t vary much. I’m surprised because I had always assumed that more hours of sleep would positively correlate to better quality of sleep.

Gender and Smoking Status

sleep %>%
  ggplot(aes(x = Sleep.efficiency, y = Gender, fill = Smoking.status)) +
  geom_boxplot() +
  labs(title = "Boxplot of Sleep Efficiency vs Gender & Smoking Status",
       x = "Sleep Efficiency Score",
       y = "Gender") +
  scale_fill_manual(name = "Smoking Status", 
                    values = c(No = "#FF9966", Yes = "#33CC99")) +
  theme_bw()

From the above grouped boxplot, we initially see that the median sleep efficiency of non-smokers is higher than that of smokers regardless of gender. In addition, the scores are more clustered for non-smokers and more spread amonst smokers. For women, the median smoking sleep score is lower than the first quartile of non-smokers. If we take a look at the maxima (far right end of whiskers), we see that the highest observed sleep efficiency scores belong to non-smokers, both males and females. Among female smokers, the interquartile range extends widely, suggesting that the impact of smoking is not very concrete and that the middle half of this sample.

Alcohol Consumption

sleep %>%
  # Points & approximate line for 'Alcohol.consumption'
  ggplot(aes(x = Alcohol.consumption, y = Sleep.efficiency)) + 
  geom_jitter(aes(color = "Alcohol"), size = 0.4, width = 0.5) + 
  geom_smooth(aes(color = "Alcohol"), method = "lm", show.legend = TRUE)  +
  
  # Points & approximate line for 'Exercise.frequency'
  geom_jitter(aes(x = Exercise.frequency, color = "Exercise"), 
              size = 0.4, width = 0.5) +
  geom_smooth(aes(x = Exercise.frequency, color = "Exercise"), 
              method = "lm", show.legend = TRUE) +
  
  labs(x = "Alcohol Consumption (oZ/day) & Weekly Exercise Frequency",
       y = "Sleep Efficiency Score",
       title = "Sleep Efficiency vs Alcohol Consumption & Exercise") + 
  scale_color_manual(name = "Predictors", 
                     values = c(Alcohol = "firebrick2", 
                                Exercise = "springgreen4")) 
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

After removing the observations with missing values, it becomes apparent that Alcohol.consumption and Exercise.frequency exhibit opposing trends when plotted against the outcome. Although both predictors’ observation points are quite randomly scattered, we ran geom_smooth() function to approximate slight linear patterns. As expected, increased daily alcohol consumption corresponds to lower sleep efficiency while more frequent exercise corresponds to higher sleep efficiency scores.

Setting Up Our Models

Now that we have explored potential relationships between our variables and gauged which predictors might have the most impact sleep efficiency scores, we can finally begin incrementally setting up our predictive model! This process starts with splitting our data into random train-test sets, then developing a universal recipe, and finally cross-validating our results across k-folds.

Train-Test Split

Before moving forward, we must split our data into two subsets: the training set and the testing set. A 70/30 ratio between train-test is pretty standard, so let’s continue using a similar proportion to split our sleep data. It’s essential to splice the data as such, so we aren’t at risk of over-fitting our model. We use the training set initially because it teaches machine learning models to pick-up on important patterns and relationships within the data. Usually, the model with the lowest root-mean-squared-error (RMSE) is considered the “best” model. Once we determine a best model, we fit it to the testing set (left alone until this later step) to evaluate how well it performs on an entirely new collection of data.

In order to maintain and reproduce our data throughout the project, I’ll go ahead and set a random seed. Next, I will create my train-test split, stratifying on the outcome variable to ensure that each subset contains an equal, well-representative distribution of Sleep.efficiency.

set.seed(123)
# Split the data, stratifying on `Sleep.efficiency`
sleep_split <- initial_split(sleep, prop = 0.75, strata = Sleep.efficiency)
sleep_train <- training(sleep_split)
sleep_test <- testing(sleep_split)

I ultimately decided on a 75/25 train-test split to reserve even more data to train my model on, while also leaving enough observations to test on. Before starting on the recipe, let’s run the following code chunks as a sanity check to verify we split the data correctly…

count(sleep_train)/nrow(sleep)
##           n
## 1 0.7474227
count(sleep_test)/nrow(sleep)
##           n
## 1 0.2525773

Perfect! The training set contains approximately 75% of our total sample, and the testing set contains the remaining 25%.

Recipe Creation

Over the course of our analysis, we will continue using the same predictors, model conditions, and response variable. Therefore, we will write one universal recipe to help preprocess the data that we’ll apply within various model setups and training procedures. Each of the model we choose will utilize this recipe using the specific methods associated with that type of model.

We will be using 11 out of 14 total predictors that we described earlier: Age, Gender, Bedtime, Sleep.duration, REM.sleep.percentage, Deep.sleep.percentage, Awakenings, Caffeine.consumption, Alcohol.consumption, Smoking.status, and Exercise.frequency.

Although they provided helpful insights during our EDA, note that we will not consider the predictors Light.sleep.percentage, Wakeup.time, and Age.group for the sake of avoiding redundancy and risk of multi-collinearity.

We already dealt with missingness way earlier! To recap, we concluded it was more appropriate to remove the missing values rather than impute them based on the nature/distribution of Awakenings, Caffeine.consumption, Alcohol.consumption, and Exercise.frequency.

We’ll then dummy encode all our categorical variables: Gender and Smoking.status. Then, let’s make sure to normalize our data by centering and scaling. This step will enhance the interpretability and performance of our models.

sleep_recipe <- recipe(Sleep.efficiency ~ Age + Gender + Bedtime +
                         Sleep.duration + REM.sleep.percentage + 
                         Deep.sleep.percentage + Awakenings +
                         Caffeine.consumption + Alcohol.consumption +
                         Smoking.status + Exercise.frequency, 
                       data = sleep_train) %>%
  # STEP1: Dummy-encoding nominal predictors
  step_dummy(Gender, Smoking.status) %>%
  # STEP2: Standardizing predictors
  step_scale(all_predictors()) %>%
  # STEP3: Normalizing predictors  
  step_center(all_predictors()) 

K-Fold Cross Validation

To better evaluate our model’s accuracy and prevent potential over-fitting, we can turn to K-fold cross validation. We often opt for K-fold cross validation, as it’s more rigorous than a simple train-test split.

This method randomly splits the training data into k-number of equally-sized subsets called “folds”. For each iteration of the process, RStudio reserves one fold as the testing set, and assigns the remaining k-minus-1 folds as the training set. Next, it fits the specified model to each training set and tested on the corresponding testing set.

It repeats the above process for each fold (or k-times), so that each fold acts as the testing set exactly once. This way, the method is validated on a different fold each time. Lastly, model performance is evaluated by averaging the testing accuracies gathered from each of the k-folds (other metrics can be used as well).

Let’s choose k=10: 10 folds is a common number to choose in a machine learning setting, and is typically rigorous enough to yield a model testing accuracy with low bias and modest variance. We will train the model on all but one of the folds, validating our model on that last remaining fold. For 10 folds, we repeat the process 10 times.

Before we forget, we should remember to stratify on our outcome variable, Sleep.efficiency, to prevent data imbalances across each of the ten folds.

# Creating our folds
sleep_folds <- vfold_cv(sleep_train, v = 10, strata = Sleep.efficiency)

Model Building

After all this tedious preparation, we’re finally ready to build our models! ## Performance Metric We will use the Root Mean Squared Error (RMSE) as our performance metric because it’s a popular choice for regression tasks, such as this project.

The RMSE measures the average deviation between a statistical model’s predictions versus the true values. In a mathematical sense, it captures the standard deviation of the residuals (distance between actual data points and the regression line). RMSE values are expressed in the same units as the outcome variable, and can range anywhere from zero—indicating a perfect model fit—to infinity. Thus, the lower the value is, the better the fit.

It’s a good thing we already normalized our data within sleep_recipe because RMSE is a measure of distance. We can apply it as a universal and easily-interpretable metric for all our models!

Fitting Models

We will create and test out several different techniques that we’ve learned throughout the course, PSTAT131 (Statistical Machine Learning). Each model will follow the same universal sleep_recipe that we created earlier. Once we fit and evaluate our 5 different models, we’ll choose the best 2 models based on lowest RMSE. Then we’ll analyze those two even further!

To fit each of our six models, we will follow a fairly straightforward and consistent series of steps detailed below:

1) Set up each model: Specify the type of model you are going to fit, the tuning parameters (if applicable), the mode (always “regression” for this project), and finally the engine that the model is sourced from.

# LINEAR REGRESSION 
lm_model <- linear_reg() %>% 
  set_engine("lm")

# RIDGE REGRESSION
ridge_model <- linear_reg(mixture = 0,
                         penalty = tune()) %>%
  set_mode("regression") %>%
  set_engine("glmnet")

# K-NEAREST NEIGHBORS 
knn_model <- nearest_neighbor(neighbors = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("kknn")

# RANDOM FOREST 
rf_model <- rand_forest(trees = tune(), 
                        mtry = tune(), 
                        min_n = tune()) %>%
  set_mode("regression") %>%
  set_engine("ranger", importance = "impurity")

# GRADIENT BOOSTED TREE
boost_tree_model <- boost_tree(trees = tune(),
                               learn_rate = tune(),
                               min_n = tune()) %>%
  set_engine("xgboost") %>%
  set_mode("regression") 

2) Set up each workflow: Set up a workflow for each model by adding the model and our preexisting sleep_recipe.

# LINEAR REGRESSION 
lm_wf <- workflow() %>%
  add_model(lm_model) %>%
  add_recipe(sleep_recipe)

# RIDGE REGRESSION
ridge_wf <- workflow() %>%
  add_model(ridge_model) %>%
  add_recipe(sleep_recipe)

# K-NEAREST NEIGHBORS 
knn_wf <- workflow() %>%
  add_model(knn_model) %>%
  add_recipe(sleep_recipe)

# RANDOM FOREST 
rf_wf <- workflow() %>%
  add_model(rf_model) %>%
  add_recipe(sleep_recipe)

# GRADIENT BOOSTED TREE
boost_tree_wf <- workflow() %>%
  add_model(boost_tree_model) %>%
  add_recipe(sleep_recipe)

3) Create tuning grids: For each hyper-parameter we want to tune, specify their ranges and the number of levels. This step will require a few rounds of testing and adjusting before we can strike an appropriate combination of tuning parameters to yield the best fit for each model. Some models (such as linear regression) don’t have any tuning parameters that we need to specify or create a grid for.

# LINEAR REGRESSION -- No tuning parameters

# RIDGE REGRESSION 
penalty_grid <- grid_regular(penalty(range = c(0.001,1)), levels = 50)

# K-NEAREST NEIGHBORS 
knn_grid <- grid_regular(neighbors(range = c(1, 20)), levels = 5)

# RANDOM FOREST 
rf_grid <- grid_regular(mtry(range = c(1,9)),
                        trees(range = c(50, 500)), 
                        min_n(range = c(5, 20)), 
                        levels = 8)

# GRADIENT BOOSTED TREE
boosted_grid <- grid_regular(trees(range = c(10, 500)),
                             learn_rate(range = c(0.01, 0.1),
                                        trans = identity_trans()),
                             min_n(range = c(1, 20)),
                             levels = 5)

4) Fitting models: Let’s fit our models to our folded sleep data. To do this, we have to specify the workflow, “K-fold” resamples, and the tuning grid.

# LINEAR REGRESSION -- No tuning
lm_fit <- fit_resamples(lm_wf, resamples = sleep_folds)

# RIDGE REGRESSION
ridge_tune <- tune_grid(
  ridge_wf,
  resamples = sleep_folds,
  grid = penalty_grid
)

# K-NEAREST NEIGHBORS 
knn_tune <- tune_grid(
  knn_wf,
  resamples = sleep_folds,
  grid = knn_grid
)

# RANDOM FOREST 
rf_tune <- tune_grid(
  rf_wf,
  resamples = sleep_folds,
  grid = rf_grid
)
  
# GRADIENT BOOSTED TREE
boosted_tune <- tune_grid(
  boost_tree_wf,
  resamples = sleep_folds,
  grid = boosted_grid
) 

5) Saving RDS Files: Let’s save each of our tuned models to an RDS file to avoid waiting for long rerunning times.

# RIDGE REGRESSION
write_rds(ridge_tune, file = "ridge.rds")
# K-NEAREST NEIGHBORS 
write_rds(knn_tune, file = "knn.rds")
# RANDOM FOREST 
write_rds(rf_tune, file = "rf.rds")
# GRADIENT BOOSTED TREE
write_rds(boosted_tune, file = "boost.rds")

6) Loading RDS Files: Now, we can load the saved RDS files back into our workspace without the long wait!

# RIDGE REGRESSION
tuned_ridge <- read_rds(file = "ridge.rds")
# K-NEAREST NEIGHBORS 
tuned_knn <- read_rds(file = "knn.rds")
# RANDOM FOREST
tuned_rf <- read_rds(file = "rf.rds")
# GRADIENT BOOSTED TREE
tuned_boost <- read_rds(file = "boost.rds")

7) Collecting metrics from tuned models: To find the model that produces the lowest RMSE, we can use select_by_one_std_error().

# LINEAR REGRESSION
best_lm <- show_best(lm_fit, metric = "rmse")

# RIDGE REGRESSION
best_ridge <- select_by_one_std_err(tuned_ridge,
                                    metric = "rmse",
                                    penalty)

# K-NEAREST NEIGHBORS 
best_knn <- select_by_one_std_err(tuned_knn,
                                  metric = "rmse",
                                  neighbors)

# RANDOM FOREST
best_rf <- select_by_one_std_err(tuned_rf,
                                  metric = "rmse",
                                  mtry, trees, min_n)

# GRADIENT BOOSTED TREE
best_boost <- select_by_one_std_err(tuned_boost,
                                    metric = "rmse",
                                    trees, learn_rate, min_n)

Model Results

Now that we’ve used select_by_one_std_error() to pick the best tuned model per each of our five models, let’s compare the results. We can return each RMSE in a nicely-formatted tibble, and arrange them in order of best to worst performance.

# Tibble with all the models and their  
final_comparison_table <- tibble(Model = c("Linear Regression", 
                                           "Ridge Regression", 
                                           "K Nearest Neighbors", 
                                           "Random Forest", "Boosted Trees"), 
                           RMSE = c(best_lm$mean, best_ridge$mean,
                                    best_knn$mean, best_rf$mean,
                                    best_boost$mean))

# Displaying in order of lowest to highest RMSE
final_comparison_table <- final_comparison_table %>%
  arrange(RMSE)

# Returning the tibble results
final_comparison_table
## # A tibble: 5 × 2
##   Model                 RMSE
##   <chr>                <dbl>
## 1 Boosted Trees       0.0496
## 2 Random Forest       0.0498
## 3 Linear Regression   0.0629
## 4 K Nearest Neighbors 0.0662
## 5 Ridge Regression    0.117

Our next task is to visualize our RMSE results, and we can create this representation in the form of a barplot.

# Creating data frame of all models & their RMSE
five_models <- data.frame(Model = c("Linear Regression", 
                                           "Ridge Regression", 
                                           "K Nearest Neighbors", 
                                           "Random Forest", "Boosted Trees"), 
                           RMSE = c(best_lm$mean, best_ridge$mean,
                                    best_knn$mean, best_rf$mean,
                                    best_boost$mean))
# Bar-plot of the RMSE's
five_models %>%
  ggplot(aes(x = Model, y = RMSE)) +
  geom_bar(aes(fill = Model), stat = "identity") +
  labs(title = "Evaluating Model Performance Via RMSE") +
  theme_classic()

Based on our RMSE tibble and bar-plot, it’s clear to see that “Random Forest” and “Gradient Boosted Trees” were the two models with the lowest RMSE—and thus, in our criteria, performed the best on the cross-validation data. We should also note that “Ridge Regression” performed the worst by far. In our case, the more complex models performed better in comparison to simpler ones such as linear regression. So, this may suggest that the relationships in our data aren’t fully linear.

Model Autoplots

Auto-plots can help us better understand the impact that tuning parameters has on the overall model performance. Let’s continue using the RMSE as our metric of choice to visually investigate our best two models.

Random Forest Plot

autoplot(tuned_rf, metric = "rmse")

For Random Tree Forest, we tuned a range of 1 to 9 predictors to be randomly sampled at each fit. Although there are 11 total predictors, we decided on a maximum of 9 to avoid the risk of over-fitting. Randomly picking from a subset of 1-9 predictors out of the total 11 for each tree helps the model capture different relationships between predictors and the outcome. I specified the range of trees as 50-500, but we can see above that the number of trees has little to no discernible impact on performance. It’s a little hard to see at first, but the minimal node size of 5 seems to correspond with slightly lower RMSE values. The plots for all 8 levels exhibit that the greater the number of randomly selected predictors, the smaller the RMSE will be.

Boosted Trees Plot

autoplot(tuned_boost, metric = "rmse")

The RMSE exhibits a substantial decrease initially as the number of trees increases to a threshold of approximately 100 trees. Before we hit this threshold, a higher learning rate parameter yields better model performance. Once there ~100 trees, the model performs noticeably better. We tuned the learning rate to a range of 0.01 to 0.1 because too large of a value will lead to some kind of trade-off: the model will learn at a quicker speed, but will produce overly specific or over-fitted results. Our goal is to strike a balance between model complexity and generalization. It seems like the minimal node size has little impact on the RMSE, but upon closer inspection, it seems that a min_n of 20 performs slightly better than the other sizes.

Best Model Results

Performance on Folds

Great! We’ve now gathered that Random Forest performed the best out of our 5 total models. So, let’s investigate it more.

best_rf
## # A tibble: 1 × 11
##    mtry trees min_n .metric .estimator   mean     n std_err .config        .best
##   <int> <int> <int> <chr>   <chr>       <dbl> <int>   <dbl> <chr>          <dbl>
## 1     3    50     5 rmse    standard   0.0498    10 0.00264 Preprocessor… 0.0476
## # ℹ 1 more variable: .bound <dbl>

The parameters that yielded the best Random Forest model are mtry = 3, trees = 50, and min_n = 5. It performed the best based on producing the lowest RMSE of 0.04911741.

Fit to Training Data

Now that we have our best model, let’s go ahead and re-fit it to the training set. Then, we’ll be able to use it with the testing data.

# Selecting best model
best_rf_train <- select_best(tuned_rf, metric = "rmse")
# Finalizing the workflow with best_rf model
rf_final_workflow_train <- finalize_workflow(rf_wf, best_rf)
# Fitting to training set
rf_final_fit_train <- fit(rf_final_workflow, data = sleep_train)
# Saving to RDS file
write_rds(rf_final_fit_train, file = "rf_final_train.rds")

Testing Our Model

Lastly, it’s time to test our model on some new data. Let’s fit it to a set of completely new and untrained data, sleep_test!

# Reloading the fitted training data
rf_final_fit_train <- read_rds(file = "rf_final_train.rds")

# Making a tibble of predicted and actual values
best_model_tibble <- predict(rf_final_fit_train, 
                             new_data = sleep_test %>%
                               select(-Sleep.efficiency))
best_model_tibble <- bind_cols(best_model_tibble, 
                               sleep_test %>% select(Sleep.efficiency))

# Saving to RDS file
write_rds(best_model_tibble, file = "final_model.rds")
# Reloading the final model
best_model_tibble <- read_rds(file = "final_model.rds")

# Naming RMSE as official metric
sleep_metric <- metric_set(rmse)

# Gathering RMSE from testing set
best_model_tibble_metrics <- sleep_metric(best_model_tibble,
                                          truth = Sleep.efficiency,
                                          estimate = .pred)
best_model_tibble_metrics
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      0.0496

Interesting! Our Random Forest model actually performed better on the training set’s cross-validation folds than on the testing set. The difference between the RMSE’s was extremely small, though… 0.04911741 for training versus 0.04936463 for testing. As described in the beginning, the RMSE metric is described in terms of the original units. In our case, it’s expressed as a proportion that can range from 0 to 1. Being said, an average RMSE of 0.049 is pretty decent—not extremely small, but also not large by any means. It’s safe to say that model performed pretty well!

Next, we should visually compare the actual values against our model’s predictions.

# Scatterplot of prediction V actual 
best_model_tibble %>%
  ggplot(aes(x = .pred, y = Sleep.efficiency)) +
  geom_point(alpha = 0.5) +
  geom_abline(lty = 3) +
  coord_obs_pred() +
  labs(title = "Predicted Values vs. Actual Values") +
  theme_minimal()

The dots form a vaguely straight line, although there are many outliers. The line isn’t as dark or defined as anticipated, but probably because we were only dealing with 388 total observations. There are lots of points that stray from the line, which is surprising based on our fairly low RMSE. But, I think our model performed decently overall!

Conclusion

In conclusion, I gathered that the best model for predicting sleep efficiency scores is Random Forest. The Random Forest model is good at capturing more complex relationships such as determining the relative importance of each predictor versus the response variable. The RMSE we got was fair… It could have been smaller than 0.049, but it also didn’t perform horribly.

Both Random Forest and Boosted Trees performed better than our 3 other models. This result indicates that the response variable most likely follows some kind of parametric (thus non-linear) form. In other words, sleep efficiency is determined by a non-linear combination of all the predictor variables we decided to include in our recipe. So, it makes sense that the Linear Regression method did not yield a very low RMSE.

K-Nearest Neighbors provided an even worse estimate potentially because the method is known to be a poor estimate for models with that are in a high-dimensional space (in our case, we included 11 predictors) with few observations (in our case, 388 total) to offset the high dimensions. In our case, we included 11 predictors across 388 total data points, which resulted in the observations being too spread apart for KNN to predict with a high accuracy.

The worst model by far was Ridge Regression because it assumes a linear pattern between the predictors and outcome. By nature, this model is simplistic and thus is at risk of under-fitting the data when the penalty is too high. There may have been some complex or multi-collinear relationships that we failed to capture in our EDA, and Ridge Regression doesn’t respond to this well.

In the future, we could adjust the formulation of the Random Forest model to generate even more accurate predictions of sleep efficiency. The RMSE ended up being higher when we fit the Random Forest model to our testing set. This could be reflective of the 75/25 test-train ratio that we decided to use—maybe there were some outlying observations in the 25% reserved for training. 50 trees yielded the best performance, so our data probably isn’t complex enough to need anywhere near the 500 trees we tuned the upper half of the range to. We could have decreased the range for the trees parameter and explored its exact effect more rigorously.

A potential question is whether we should have chosen to keep Light.sleep.percentage as a predictor in our recipe, or if it would’ve been wiser to convert Deep.sleep.percentage, Light.sleep.percentage, and REM.sleep.percentage into categorical variables with the levels “yes” and “no”. Maybe we should have converted Bedtime and Wakeup.time into factors categorizing whether the time was early versus late. In addition, I would like to explore if the project were more suitable as a classification problem that would require us to establish a few cut-off criteria associated with different sleep efficiency percentages. Then, interpreting the significance of the outcome variable’s values may become more straightforward.

If we were to continue enhancing this predictive model, it would be helpful to gather more data observations and explore more predictors related to sleep patterns and general health, because there are so many factors that contribute to the complex science of human sleep. Here are a few ideas I have: perceived stress levels, smartphone usage, general health, socioeconomic status, hormonal levels based on sex, medication usage, and recreational drug usage (especially marijuana). I think this would be especially intriguing since marijuana is a very commonly-used substance among college students, yet its usage is known to decrease the amount of time spend in REM sleep. A completely new project idea would be predicting one’s perceived happiness based on sleep efficiency and all the lifestyle predictors above.

On the whole, this Statistical Machine Learning project taught me how to implement all of the methods we learned throughout the quarter and apply it to a topic that I’ve been personally interested in exploring. Although the process was quite time-consuming and challenging, I learned how to troubleshoot independently and gained a lot more confidence in terms of approaching an individual project!

Sources

I sourced the sleep_efficiency.csv file for this project off of Kaggle, which you can easily access here: https://www.kaggle.com/datasets/equilibriumm/sleep-efficiency/data. The data was scraped from a sleep study conducted in Morocco by a group of AI engineering students from ENSIAS, and uploaded by the user EQUILIBRIUMM onto Kaggle for public usage. The original study aimed to investigate the impact of certain lifestyle factors on sleep quality and sleep patterns.